4 research outputs found

    Machine learning ensemble method for discovering knowledge from big data

    Get PDF
    Big data, generated from various business internet and social media activities, has become a big challenge to researchers in the field of machine learning and data mining to develop new methods and techniques for analysing big data effectively and efficiently. Ensemble methods represent an attractive approach in dealing with the problem of mining large datasets because of their accuracy and ability of utilizing the divide-and-conquer mechanism in parallel computing environments. This research proposes a machine learning ensemble framework and implements it in a high performance computing environment. This research begins by identifying and categorising the effects of partitioned data subset size on ensemble accuracy when dealing with very large training datasets. Then an algorithm is developed to ascertain the patterns of the relationship between ensemble accuracy and the size of partitioned data subsets. The research concludes with the development of a selective modelling algorithm, which is an efficient alternative to static model selection methods for big datasets. The results show that maximising the size of partitioned data subsets does not necessarily improve the performance of an ensemble of classifiers that deal with large datasets. Identifying the patterns exhibited by the relationship between ensemble accuracy and partitioned data subset size facilitates the determination of the best subset size for partitioning huge training datasets. Finally, traditional model selection is inefficient in cases wherein large datasets are involved

    Heterogeneous Ensemble for Imaginary Scene Classification

    Get PDF
    In data mining, identifying the best individual technique to achieve very reliable and accurate classification has always been considered as an important but non-trivial task. This paper presents a novel approach - heterogeneous ensemble technique, to avoid the task and also to increase the accuracy of classification. It combines the models that are generated by using methodologically different learning algorithms and selected with different rules of utilizing both accuracy of individual modules and also diversity among the models. The key strategy is to select the most accurate model among all the generated models as the core model, and then select a number of models that are more diverse from the most accurate model to build the heterogeneous ensemble. The framework of the proposed approach has been implemented and tested on a real-world data to classify imaginary scenes. The results show our approach outperforms other the state of the art methods, including Bayesian network, SVM an d AdaBoost

    An Algorithm for Identifying the Learning Patterns in Big Data

    No full text
    Divide-and-Conquer is probably the most commonly used strategy to deal with a big data that is too big to be loaded into any computing systems memory as a whole for analysis. It partitions such a big dataset into many smaller subsets that can be loaded into computer memory separately to induce models, which can be combined by machine learning ensemble methods. However, it is not clear that how the size of subsets may affect the learning performance of individual models and their ensemble. This paper proposes an ensemble based algorithm to quickly detect their relational patterns in terms of ensemble accuracy and the size of partitioned data subset. An ensemble framework of the algorithm is implemented and tested on 12 relatively big benchmark datasets. The experimental results indicate that it is able to identify the relation patterns accurately and efficiently in less than 10 steps. The identified patterns show that in most cases it is not necessary to use the whole big dataset for analysis as few smaller subsets are already sufficiently representative of the underlying problem, which is obviously a useful knowledge in big data analysis
    corecore